In my work I used the open source H2O R package for PCA. The H2O package actually is just a REST client for an H2O cluster. Used locally it can start up a ‘local cluster’ which can be used through Python, R, Java, Scala APIs or through WebUI(Flow).
For the interactive visualizations I used Plotly’s R library. It has other API-s for example in Python Matlab Scala Julia etc. (and Javascript as native)
knitr::opts_chunk$set(echo = TRUE)
library(plotly)
library(h2o)
h2o.init() #local initialization
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 days 7 hours
## H2O cluster version: 3.10.4.6
## H2O cluster version age: 1 month and 10 days
## H2O cluster name: H2O_started_from_R_Matty_oyu065
## H2O cluster total nodes: 1
## H2O cluster total memory: 2.91 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 3.3.2 (2016-10-31)
To be able to work with the data first we have to read and “upload” to the cluster. It has options to load data from Amazon S3, Hadoop HDFS, SQL and NoSql stores but now we just read from local filesystem.
data <- read.csv("usps_ch.txt",header = FALSE)
data <- data[,1:ncol(data)-1] #there is a ',' in the end of each line
lab <- read.csv("usps_val.txt",header = FALSE)
colnames(lab) <- c('lab')
labeled_data <- cbind(data,lab)
#write.csv(data,"usps.csv",append = FALSE,col.names = TRUE,quote = FALSE,row.names = FALSE)
usps_data <- as.h2o(data,'usps')
##
|
| | 0%
|
|=================================================================| 100%
When we work with the ‘cluster’ it gives as feedback the percentage of completion of the tasks, which is then ‘knitted’(rendered) into the document.
The numbers are hardly recognizable, but it seems that they are upside down.(or I made some mistake in filling the matrixes). Anyway it shouldn’t cause any problem.
image(matrix(unlist(data[4235,]),nrow=16,ncol=16), col = gray(0:64 / 64))
image(matrix(unlist(data[7638,]),nrow=16,ncol=16), col = gray(0:64 / 64))
image(matrix(unlist(data[2681,]),nrow=16,ncol=16), col = gray(0:64 / 64))
For creating the PCA model from an H2O data frame we can set several parameters. I just used the default ones.
wholeres<-h2o.prcomp(training_frame=usps_data, x=1:256, k=256, compute_metrics = TRUE)
##
|
| | 0%
|
|==================================================== | 80%
|
|=================================================================| 100%
params <- as.data.frame(wholeres@model$importance)
cumul_prop <- params["Cumulative Proportion",]
row.names(cumul_prop) <- c("all")
allres <- data.frame(x=1:length(cumul_prop["all",]),cumprop=t(cumul_prop["all",]),label="all",stringsAsFactors=FALSE)
par <- params["Cumulative Proportion",]
row.names(par) <- c("all")
allnum <- data.frame(x=1:length(par["all",]),y=t(par["all",]),label="Cumulative Proportion")
par <- params["Proportion of Variance",]
row.names(par) <- c("all")
allnum <- rbind(allnum,data.frame(x=1:length(par["all",]),y=t(par["all",]),label="Proportion of Variance"))
par <- params["Standard deviation",]
row.names(par) <- c("all")
allnum <- rbind(allnum,data.frame(x=1:length(par["all",]),y=logb(t(par["all",]),base = 10),label="Log Standard Deviation"))
plot_ly(allnum,x=~x, y=~all, color = ~label) %>% add_lines()
image(matrix(wholeres@model$eigenvectors$pc1,nrow=16), col = gray(0:32 / 32))
From the first plot we can see that the proportion of variance falls drastically after a few components, and the sd in the last few components(~25) are smaller then 1 so they don’t carry any information.
By plotting just the first principle component we can see that the most relevant pixels are in the middle and the ones on the corners doesn’t have any variance, they are just black…
We can see on the images which pixels are most ‘relevant’ for a specific number.
for(i in 0:9){
numdata <- labeled_data[labeled_data$lab == i,]
dfname <- paste('usps_',i,sep = "")
part_data <- as.h2o(numdata,dfname)
pcares <- h2o.prcomp(training_frame=part_data, x=1:256, k=256, compute_metrics = TRUE,ignore_const_cols = FALSE)
params2 <- as.data.frame(pcares@model$importance)
c_prop <- params2["Cumulative Proportion",]
row.names(c_prop) <- c("all")
cumul_prop <- rbind(cumul_prop,c_prop)
numres <- data.frame(x=1:length(c_prop["all",]),cumprop=t(c_prop["all",]),label=dfname)
allres <- rbind(allres,numres)
image(matrix(pcares@model$eigenvectors$pc1,nrow=16), col = gray(0:32 / 32))
}
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|============= | 20%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|============= | 20%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|============= | 20%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|============= | 20%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|============= | 20%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|========================== | 40%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|============= | 20%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|============= | 20%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|============= | 20%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|============= | 20%
|
|=================================================================| 100%
The numbers(lines) can be hidden by clicking on the plot ‘legend’(labels)
plot_ly(allres,x=~x,y=~all,color=~label)%>%add_lines()
summary(wholeres)
## Model Details:
## ==============
##
## H2ODimReductionModel: pca
## Model Key: PCA_model_R_1496656659453_62
## Importance of components:
## pc1 pc2 pc3 pc4
## Standard deviation 1274.244990 464.901006 348.192752 275.355291
## Proportion of Variance 0.629631 0.083811 0.047013 0.029401
## Cumulative Proportion 0.629631 0.713442 0.760455 0.789856
## pc5 pc6 pc7 pc8
## Standard deviation 266.379566 234.243032 206.933140 190.812392
## Proportion of Variance 0.027516 0.021277 0.016605 0.014119
## Cumulative Proportion 0.817372 0.838649 0.855254 0.869373
## pc9 pc10 pc11 pc12
## Standard deviation 179.007905 165.831496 161.606414 150.475354
## Proportion of Variance 0.012426 0.010664 0.010127 0.008780
## Cumulative Proportion 0.881799 0.892462 0.902590 0.911370
## pc13 pc14 pc15 pc16
## Standard deviation 133.837194 127.067289 120.947457 114.528384
## Proportion of Variance 0.006946 0.006261 0.005672 0.005086
## Cumulative Proportion 0.918316 0.924577 0.930250 0.935336
## pc17 pc18 pc19 pc20 pc21
## Standard deviation 111.146195 104.260473 99.559564 95.553429 94.014504
## Proportion of Variance 0.004790 0.004215 0.003844 0.003541 0.003427
## Cumulative Proportion 0.940126 0.944342 0.948185 0.951726 0.955153
## pc22 pc23 pc24 pc25 pc26
## Standard deviation 89.776840 82.961567 81.159907 78.549222 72.698557
## Proportion of Variance 0.003125 0.002669 0.002554 0.002393 0.002049
## Cumulative Proportion 0.958279 0.960948 0.963502 0.965894 0.967944
## pc27 pc28 pc29 pc30 pc31
## Standard deviation 70.951298 65.533636 64.894571 63.995619 62.489143
## Proportion of Variance 0.001952 0.001665 0.001633 0.001588 0.001514
## Cumulative Proportion 0.969896 0.971561 0.973194 0.974782 0.976297
## pc32 pc33 pc34 pc35 pc36
## Standard deviation 59.042552 57.364689 56.125212 54.101502 53.085600
## Proportion of Variance 0.001352 0.001276 0.001222 0.001135 0.001093
## Cumulative Proportion 0.977648 0.978925 0.980146 0.981281 0.982374
## pc37 pc38 pc39 pc40 pc41
## Standard deviation 50.503584 49.527218 47.773459 46.980314 46.053683
## Proportion of Variance 0.000989 0.000951 0.000885 0.000856 0.000822
## Cumulative Proportion 0.983363 0.984314 0.985199 0.986055 0.986877
## pc42 pc43 pc44 pc45 pc46
## Standard deviation 42.304105 41.730306 39.608456 38.828366 37.202680
## Proportion of Variance 0.000694 0.000675 0.000608 0.000585 0.000537
## Cumulative Proportion 0.987571 0.988247 0.988855 0.989440 0.989976
## pc47 pc48 pc49 pc50 pc51
## Standard deviation 36.165489 35.437391 34.839084 33.689939 32.468130
## Proportion of Variance 0.000507 0.000487 0.000471 0.000440 0.000409
## Cumulative Proportion 0.990484 0.990971 0.991441 0.991881 0.992290
## pc52 pc53 pc54 pc55 pc56
## Standard deviation 31.990247 30.801097 29.967805 29.027644 27.889147
## Proportion of Variance 0.000397 0.000368 0.000348 0.000327 0.000302
## Cumulative Proportion 0.992687 0.993055 0.993403 0.993730 0.994031
## pc57 pc58 pc59 pc60 pc61
## Standard deviation 27.050286 26.235012 25.742545 25.601171 25.132786
## Proportion of Variance 0.000284 0.000267 0.000257 0.000254 0.000245
## Cumulative Proportion 0.994315 0.994582 0.994839 0.995093 0.995338
## pc62 pc63 pc64 pc65 pc66
## Standard deviation 24.433313 23.275377 22.897983 22.142775 21.992126
## Proportion of Variance 0.000231 0.000210 0.000203 0.000190 0.000188
## Cumulative Proportion 0.995570 0.995780 0.995983 0.996173 0.996361
## pc67 pc68 pc69 pc70 pc71
## Standard deviation 20.856347 20.192318 19.807985 19.183744 18.714317
## Proportion of Variance 0.000169 0.000158 0.000152 0.000143 0.000136
## Cumulative Proportion 0.996529 0.996687 0.996840 0.996982 0.997118
## pc72 pc73 pc74 pc75 pc76
## Standard deviation 17.907908 17.396736 17.039673 16.953954 15.986280
## Proportion of Variance 0.000124 0.000117 0.000113 0.000111 0.000099
## Cumulative Proportion 0.997242 0.997360 0.997472 0.997584 0.997683
## pc77 pc78 pc79 pc80 pc81
## Standard deviation 15.842540 15.600678 15.408810 14.986807 14.906695
## Proportion of Variance 0.000097 0.000094 0.000092 0.000087 0.000086
## Cumulative Proportion 0.997780 0.997875 0.997967 0.998054 0.998140
## pc82 pc83 pc84 pc85 pc86
## Standard deviation 14.180931 13.997363 13.510558 13.199411 13.164774
## Proportion of Variance 0.000078 0.000076 0.000071 0.000068 0.000067
## Cumulative Proportion 0.998218 0.998294 0.998365 0.998432 0.998500
## pc87 pc88 pc89 pc90 pc91
## Standard deviation 12.706910 12.471315 12.294626 11.920947 11.405376
## Proportion of Variance 0.000063 0.000060 0.000059 0.000055 0.000050
## Cumulative Proportion 0.998562 0.998622 0.998681 0.998736 0.998787
## pc92 pc93 pc94 pc95 pc96
## Standard deviation 10.868350 10.755065 10.615904 10.447391 10.240022
## Proportion of Variance 0.000046 0.000045 0.000044 0.000042 0.000041
## Cumulative Proportion 0.998832 0.998877 0.998921 0.998963 0.999004
## pc97 pc98 pc99 pc100 pc101
## Standard deviation 10.039655 9.639609 9.619014 9.371075 9.049477
## Proportion of Variance 0.000039 0.000036 0.000036 0.000034 0.000032
## Cumulative Proportion 0.999043 0.999079 0.999115 0.999149 0.999181
## pc102 pc103 pc104 pc105 pc106
## Standard deviation 8.847642 8.520904 8.393524 8.275680 8.151704
## Proportion of Variance 0.000030 0.000028 0.000027 0.000027 0.000026
## Cumulative Proportion 0.999211 0.999239 0.999267 0.999293 0.999319
## pc107 pc108 pc109 pc110 pc111
## Standard deviation 7.932270 7.806701 7.544461 7.462464 7.317688
## Proportion of Variance 0.000024 0.000024 0.000022 0.000022 0.000021
## Cumulative Proportion 0.999343 0.999367 0.999389 0.999411 0.999431
## pc112 pc113 pc114 pc115 pc116
## Standard deviation 7.231821 7.056751 7.036919 6.809337 6.749605
## Proportion of Variance 0.000020 0.000019 0.000019 0.000018 0.000018
## Cumulative Proportion 0.999452 0.999471 0.999490 0.999508 0.999526
## pc117 pc118 pc119 pc120 pc121
## Standard deviation 6.717518 6.654124 6.454950 6.134173 6.045494
## Proportion of Variance 0.000017 0.000017 0.000016 0.000015 0.000014
## Cumulative Proportion 0.999543 0.999561 0.999577 0.999591 0.999605
## pc122 pc123 pc124 pc125 pc126
## Standard deviation 5.938174 5.866071 5.708103 5.527348 5.437435
## Proportion of Variance 0.000014 0.000013 0.000013 0.000012 0.000011
## Cumulative Proportion 0.999619 0.999632 0.999645 0.999657 0.999668
## pc127 pc128 pc129 pc130 pc131
## Standard deviation 5.401527 5.345411 5.304517 5.192840 5.111759
## Proportion of Variance 0.000011 0.000011 0.000011 0.000010 0.000010
## Cumulative Proportion 0.999680 0.999691 0.999702 0.999712 0.999722
## pc132 pc133 pc134 pc135 pc136
## Standard deviation 4.882674 4.840304 4.809149 4.694346 4.531503
## Proportion of Variance 0.000009 0.000009 0.000009 0.000009 0.000008
## Cumulative Proportion 0.999732 0.999741 0.999750 0.999758 0.999766
## pc137 pc138 pc139 pc140 pc141
## Standard deviation 4.460919 4.415453 4.356742 4.329604 4.198763
## Proportion of Variance 0.000008 0.000008 0.000007 0.000007 0.000007
## Cumulative Proportion 0.999774 0.999781 0.999789 0.999796 0.999803
## pc142 pc143 pc144 pc145 pc146
## Standard deviation 4.098713 4.075505 4.035855 4.000611 3.950898
## Proportion of Variance 0.000007 0.000006 0.000006 0.000006 0.000006
## Cumulative Proportion 0.999809 0.999816 0.999822 0.999828 0.999834
## pc147 pc148 pc149 pc150 pc151
## Standard deviation 3.891251 3.804353 3.739364 3.701704 3.593626
## Proportion of Variance 0.000006 0.000006 0.000005 0.000005 0.000005
## Cumulative Proportion 0.999840 0.999846 0.999851 0.999857 0.999862
## pc152 pc153 pc154 pc155 pc156
## Standard deviation 3.511938 3.476335 3.405915 3.314068 3.296975
## Proportion of Variance 0.000005 0.000005 0.000004 0.000004 0.000004
## Cumulative Proportion 0.999866 0.999871 0.999876 0.999880 0.999884
## pc157 pc158 pc159 pc160 pc161
## Standard deviation 3.265922 3.182625 3.139144 3.129682 3.080028
## Proportion of Variance 0.000004 0.000004 0.000004 0.000004 0.000004
## Cumulative Proportion 0.999888 0.999892 0.999896 0.999900 0.999903
## pc162 pc163 pc164 pc165 pc166
## Standard deviation 3.047589 2.993577 2.982730 2.936685 2.903852
## Proportion of Variance 0.000004 0.000003 0.000003 0.000003 0.000003
## Cumulative Proportion 0.999907 0.999910 0.999914 0.999917 0.999921
## pc167 pc168 pc169 pc170 pc171
## Standard deviation 2.821971 2.768194 2.692140 2.641267 2.600669
## Proportion of Variance 0.000003 0.000003 0.000003 0.000003 0.000003
## Cumulative Proportion 0.999924 0.999927 0.999929 0.999932 0.999935
## pc172 pc173 pc174 pc175 pc176
## Standard deviation 2.548103 2.515712 2.512119 2.441507 2.432208
## Proportion of Variance 0.000003 0.000002 0.000002 0.000002 0.000002
## Cumulative Proportion 0.999937 0.999940 0.999942 0.999944 0.999947
## pc177 pc178 pc179 pc180 pc181
## Standard deviation 2.371046 2.340110 2.316440 2.297229 2.266848
## Proportion of Variance 0.000002 0.000002 0.000002 0.000002 0.000002
## Cumulative Proportion 0.999949 0.999951 0.999953 0.999955 0.999957
## pc182 pc183 pc184 pc185 pc186
## Standard deviation 2.189303 2.165049 2.127722 2.114389 2.019621
## Proportion of Variance 0.000002 0.000002 0.000002 0.000002 0.000002
## Cumulative Proportion 0.999959 0.999961 0.999963 0.999964 0.999966
## pc187 pc188 pc189 pc190 pc191
## Standard deviation 2.017006 1.917275 1.904669 1.857218 1.846740
## Proportion of Variance 0.000002 0.000001 0.000001 0.000001 0.000001
## Cumulative Proportion 0.999968 0.999969 0.999970 0.999972 0.999973
## pc192 pc193 pc194 pc195 pc196
## Standard deviation 1.826634 1.785150 1.743534 1.718968 1.713053
## Proportion of Variance 0.000001 0.000001 0.000001 0.000001 0.000001
## Cumulative Proportion 0.999974 0.999976 0.999977 0.999978 0.999979
## pc197 pc198 pc199 pc200 pc201
## Standard deviation 1.658477 1.616345 1.567824 1.514959 1.499081
## Proportion of Variance 0.000001 0.000001 0.000001 0.000001 0.000001
## Cumulative Proportion 0.999980 0.999981 0.999982 0.999983 0.999984
## pc202 pc203 pc204 pc205 pc206
## Standard deviation 1.478476 1.448103 1.417590 1.381124 1.368978
## Proportion of Variance 0.000001 0.000001 0.000001 0.000001 0.000001
## Cumulative Proportion 0.999985 0.999985 0.999986 0.999987 0.999988
## pc207 pc208 pc209 pc210 pc211
## Standard deviation 1.322920 1.280988 1.246856 1.238697 1.219711
## Proportion of Variance 0.000001 0.000001 0.000001 0.000001 0.000001
## Cumulative Proportion 0.999988 0.999989 0.999990 0.999990 0.999991
## pc212 pc213 pc214 pc215 pc216
## Standard deviation 1.174494 1.159001 1.111435 1.086112 1.053414
## Proportion of Variance 0.000001 0.000001 0.000000 0.000000 0.000000
## Cumulative Proportion 0.999991 0.999992 0.999992 0.999993 0.999993
## pc217 pc218 pc219 pc220 pc221
## Standard deviation 1.017867 1.011305 1.000389 0.973905 0.959337
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 0.999994 0.999994 0.999994 0.999995 0.999995
## pc222 pc223 pc224 pc225 pc226
## Standard deviation 0.925202 0.867259 0.848138 0.831245 0.821670
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 0.999995 0.999996 0.999996 0.999996 0.999997
## pc227 pc228 pc229 pc230 pc231
## Standard deviation 0.806934 0.790703 0.775839 0.724028 0.688933
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 0.999997 0.999997 0.999997 0.999997 0.999998
## pc232 pc233 pc234 pc235 pc236
## Standard deviation 0.682407 0.670796 0.654977 0.633642 0.619227
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 0.999998 0.999998 0.999998 0.999998 0.999998
## pc237 pc238 pc239 pc240 pc241
## Standard deviation 0.589804 0.572931 0.564169 0.539657 0.517071
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 0.999999 0.999999 0.999999 0.999999 0.999999
## pc242 pc243 pc244 pc245 pc246
## Standard deviation 0.501699 0.492296 0.478275 0.464773 0.439308
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 0.999999 0.999999 0.999999 0.999999 1.000000
## pc247 pc248 pc249 pc250 pc251
## Standard deviation 0.423885 0.412680 0.397621 0.371710 0.343730
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 1.000000 1.000000 1.000000 1.000000 1.000000
## pc252 pc253 pc254 pc255 pc256
## Standard deviation 0.323664 0.312185 0.306788 0.285587 0.281602
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 1.000000 1.000000 1.000000 1.000000 1.000000
##
## H2ODimReductionMetrics: pca
##
## No model metrics available for PCA
##
##
##
## Scoring History for GramSVD:
## timestamp duration iteration
## 1 2017-06-06 20:33:37 0.763 sec 0
If we just want to inspect 2 classes we can turn off the others as on the other plot.
For example if we check the 5 and 6 we can see that in 2D we can’t really separate the 2 classes but in 3D the data looks more promising.
lb2 <- lapply(lab,as.character) #to have labels
proj <- data.matrix(data[,]) %*% cbind(wholeres@model$eigenvectors$pc1,wholeres@model$eigenvectors$pc2,wholeres@model$eigenvectors$pc3)
proj <- data.frame(x=proj[,1],y=proj[,2],z=proj[,3],label=lb2$lab)
plot_ly(proj,x=~x,y=~y,color =~label,type ='scatter',text=~label)%>% toWebGL() #2d
plot_ly(proj,x=~x,y=~y,z=~z,color =~label,type ='scatter3d',text=~label) %>% toWebGL() #3d